Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2018, Vol. 44 ›› Issue (10): 22-27. doi: 10.19678/j.issn.1000-3428. 0051189

Special Issue: Machine Learning

Previous Articles     Next Articles

Detection Method for Hidden Hyperlink Based on Machine Learning

ZHOU Wenyiaa,GU Xubobb,SHI Yongaa,XUE Zhiaa   

  1. a.School of Cyber Security; b.School of Mechanical Engineering,Shanghai Jiaotong University,Shanghai 200240,China
  • Received:2018-04-12 Online:2018-10-15 Published:2018-10-15

基于机器学习的网页暗链检测方法

周文怡a,顾徐波b,施勇a,薛质a   

  1. 上海交通大学 a.网络空间安全学院; b.机械与动力工程学院,上海 200240
  • 作者简介:周文怡(1994—),女,硕士研究生,主研方向为网络安全、机器学习、数据挖掘;顾徐波,硕士研究生;施勇,讲师;薛质,教授。
  • 基金资助:

    国家自然科学基金重点项目(61332010)。

Abstract:

In the era of big data,traditional hidden hyperlink detection technology cannot quickly and accurately identify websites that encounter “hidden hyperlink attacks” on massive Web pages.To solve this problem,this paper introduces machine learning to the detection method for hidden hyperlink,which combines the characteristics of hidden hyperlink related texts,hidden hyperlink domains and the hidden structure of hidden hyperlink.The three models are constructed and compared using Classification and Regression Tree (CART),Gradient Boosted Decision Tree (GBDT) and Random Forest (RF).based on the proposed method.Experimental results show that the proposed method has high accuracy and reliability,and the classification accuracy of the detection model constructed by RF can reach 0.984.

Key words: hidden hyperlink, feature extraction, cross validation, Classification and Regression Tree(CART), Random Forest(RF), Gradient Boosted Decision Tree(GBDT)

摘要:

在大数据时代下,传统暗链检测技术无法在海量网页中快速准确地识别出遭遇“暗链攻击”的网站。为此,提出一种引入机器学习的方法研究网页的暗链检测。该方法结合暗链的域名、相关文本及隐藏结构3种特征,分别采用分类与回归树、梯度提升决策树及随机森林3种算法来构建检测模型并对比其的性能。实验结果表明,该方法具有较高的准确性和可靠性,其中随机森林构建的检测模型分类准确率可以达到0.984。

关键词: 暗链, 特征提取, 交叉验证, 分类与回归树, 随机森林, 梯度提升决策树

CLC Number: